Field
The present disclosure generally relates to processing text content in digital images of documents or forms. More specifically, the present disclosure provides techniques for identifying fields and/or labels in a digital image of a form without using optical character recognition (OCR).
Related Art
Forms are often used to collect, register, or record certain types of information about an entity (e.g., a person or a business), a transaction (e.g., a sale), an event (e.g., a birth), a contract (e.g., a rental agreement), or some other matter of interest. A form typically contains fields or sections for specific types of information associated with the subject matter of the form. A field is typically associated with one or more labels identifying the type of information that should be found in the field. In order to make information more readily accessible or electronically searchable, individuals, businesses, and governmental agencies often seek to digitize text found on paper forms. Optical character recognition (OCR) techniques are generally used to convert images of text into computer-encoded text. Satisfactory results can typically be achieved when OCR is applied to high-resolution, low-noise images of typed, uniformly black text against a uniformly white background.
Labels and fields generally allow desired information to be located quickly and unambiguously when a form is inspected. Thus, when a paper form is digitized, it can be useful to identify labels and fields within the digitized form. However, several difficulties may arise when OCR is applied to an image of a paper form. First, if the image quality is poor, the text of some labels may be incorrectly interpreted. Furthermore, even if the image quality is high, some labels may be in non-standard fonts or may be formatted unusually. On a certificate, for example, a label such as a title may be in an unusual calligraphic font against a watermark background and may be formatted using effects such as three-dimensional rotation, skewing, shading, shadowing, or reflecting. Such unusually formatted labels may defy computer interpretation by OCR.